Instructions

In this week’s lab, the main goal is to gain some experience in building models, to explore and explain data. We will start with the famous gapminder data, and use regression models to study temporal trends in life expectancy across the globe. The we will use decision trees to build a spam filter, using data collected by Dr Cook and her students a number of years ago.

Warmups

  • Watch the movie Hans Rosling’s TED talk
  • Email headers have a lot of information about where the email originated from, where the reply goes to, size, domains, … This is usually hidden by your mail handling software, unless you specifically request to view it. See if you can get your mail handler to show you the full header of an email. For example, here is a sample from an email to me from David Frazier:
Received: by 10.103.136.194 with HTTP; Sun, 17 Sep 2017 14:57:50 -0700 (PDT)
In-Reply-To: <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
References: <6A89C7A8-CA54-42BE-938F-CF41CCE2F362@monash.edu> <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
From: David Frazier <david.frazier@monash.edu>
Date: Mon, 18 Sep 2017 07:57:50 +1000
Message-ID: <CAFvWOF+i6U=tFsb2v+2yQ1L91zXXcusSLKBe=XJwHXdY-7JZJQ@mail.gmail.com>
Subject: Re: formula sheet
To: Dianne Cook <dicook@monash.edu>
Content-Type: multipart/mixed; boundary="001a114fcb3aabdccf055969b77e"

--001a114fcb3aabdccf055969b77e
Content-Type: multipart/alternative; boundary="001a114fcb3aabdccd055969b77c"

--001a114fcb3aabdccd055969b77c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi There,

...

Cheers,
David

Exercise 1

Open your project for this class. Make sure all your work is done relative to this project.

Open the lab10.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.

Exercise 2

The data has demographics of life expectancy and GDP per capita for 142 countries reported every 5 years between 1952 and 2007.

Observations: 1,704
Variables: 6
$ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
$ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
  1. (2pts) How would you describe the following plot? The plot of all the countries is really hard to explain, its very messy. There is generally some increasing tend in the lines. Some lines have big drops.

  1. 1950 is the first year, so for model fitting we are going to shift year to begin in 1950, makes interpretability easier.

  2. Then let’s fit a model for Australia

# A tibble: 6 x 7
    country continent  year lifeExp      pop gdpPercap year1950
     <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>    <dbl>
1 Australia   Oceania  1952   69.12  8691212  10039.60        2
2 Australia   Oceania  1957   70.33  9712569  10949.65        7
3 Australia   Oceania  1962   70.93 10794968  12217.23       12
4 Australia   Oceania  1967   71.10 11872264  14526.12       17
5 Australia   Oceania  1972   71.93 13177000  16788.63       22
6 Australia   Oceania  1977   73.49 14074100  18334.20       27


Call:
lm(formula = lifeExp ~ year1950, data = oz)

Coefficients:
(Intercept)     year1950  
    67.9451       0.2277  
  1. (2pts) Interpret the model. (This means explain how life expectancy changes over years, since 1950, using the parameter estimates of the model.) From a life expectancy of 67.9 in 1950, it has increased by 2.2 years for every decade.

  2. (1pt) What was the average life expectancy in 1950? 67.9

  3. (1pt) What was the average life expectancy in 2000? 79.3

  4. (1pt) By how much did average life expectancy change over those 50 years? About 12 years

  5. We can get various diagnostics out for the model with the broom package: the parameter estimates and their significance, the goodness of fit statistics, and model diagnostics.

  6. (1pt) What column of the diagnostics contains the (a) fitted values, (b) residuals? .fitted and .resid respectively.

Exercise 3

Now we are going to fit a simple linear model separately to every country. And use the model fits to simplify the patterns across the globe, in order to be able to explain the changes in life expectancy.

This code will compute the models for you:

country continent intercept year1950
Afghanistan Asia 29.35664 0.2753287
Albania Europe 58.55976 0.3346832
Algeria Africa 42.23641 0.5692797
Angola Africa 31.70797 0.2093399
Argentina Americas 62.22502 0.2317084
Australia Oceania 67.94507 0.2277238
# A tibble: 1 x 4
    country continent intercept  year1950
     <fctr>    <fctr>     <dbl>     <dbl>
1 Australia   Oceania  67.94507 0.2277238

It is also possible to use a for loop to compute the slope and intercept for each country.

  1. (2pts) Pick your favorite country (other than Australia). Find the parameter estimates from the country_coefs data frame. Do a hand-sketch of the fitted model. Various answers, but should show the correct intercept and slope with axes marked and labels.

Exercise 4

  1. (2pts) Make a scatterplot of the linear model estimates for each country, slope vs intercept. Colour the points by continent. Make the plot interactive using the plotly package, and find out which countries had a negative slope. Rwanda, Zambia, Zimbabwe
  1. (2pts) Statistically summarise the relationship between intercept and slope, using words like no association, positive linear association, negative linear association, weak, moderate, strong, outliers, clusters. The association is negative moderate and linear, with some clustering by continent.

  2. (2pts) Do you see a difference between continents? If so, explain what you see. Africa shows the lowest intercepts and most variation in the slope. Europe is high on intercept and low on slope. Asia, like the Americas, is varied on intercept but relatively high on slope.

  3. (2pts) What does it mean for a country to have a high intercept, e.g. 70? The life expectancy in 1950 was quite high, e.g. 70 years.

  4. (2pts) What does it mean for a country to have a high slope, e.g. 0.7? The country had dramatic increase in life expectancy over the years. A value of 0.7 means life expectancy increased by 7 years for every decade.

Exercise 5

Now we are going to examine the fit for each country. We might expect that a linear model is a better fit for some countries and not so good for other countries. Here is the code to extract the model diagnostics for each country’s model.

Or you can use a for loop to compute this.

  1. Plot the \(R^2\) values as a histogram.

  1. (2pts) Examine the countries with the worst fit, countries with \(R^2<0.45\), by making scatterplots of the data, with the linear model overlaid.

Each of these countries has a big dip in their life expectancy during the time of the study. Explain these using world history and current affairs information. (Feel free to google for news stories.) Civil wars and AIDS

Exercise 6

The file SPAM-503.csv contains summaries of a week’s worth of emails from 19 people. The email was manually labelled as spam or not. We decided to examine our emails because the university had recently changed its spam filters, and the emails from the university president were being sent to spam. Spam filters have improved dramatically in the last decade, but something happened to Monash mail this past week. Emails from Monash students and departmental colleagues have been discovered in the spam folder. Here is a description of the variables:

1. ISUid: ISU id 

2. id: e-mail id (some count from 1 to number of mails you got, so that
you can get back to the original message for the line of data -
to help with checking for strange results.)

3. Day of Week: Sun, Mon, Tue, Wed, Thu, Fri, Sat

4. Time of Day: 0-23 (only integer values)

5. Size [kb]: Size of e-mail in kilo byte

6. Box: Is sender in any of my Inboxes or Outbox (ie known to you) 
yes, no

7. Domain: Domain name of sender's e-mail address (only last segment):.edu,
.net, .com, .org, .gov, .mil, .de, .fr, .ru,

8. Local: Sender's e-mail is in local domain i.e. xx@yy.iastate.edu 
yes, no

9. Digits: Number of numbers (0-9) in the
senders name: e.g. lottery2003@yahoo.com will be 4.

10. name: Name field is a single word or empty: 
e.g. "Andreas Buja <andreas@research.att.com>" is name
 "bob <lottery2003@yahoo.com>" is single 
 "<lottery2003@yahoo.com>" is empty

11. %capital: % capital letters in subject line

12. NSpecial: umber of special characters (i.e. non a-z, A-Z or 0-9) in subject

Spam words in subject line:

13. credit: mortgage, sale, approve, credit -> yes/no
14. sucker: earn, free, save ->yes/no
15. porn: nude, sex, enlarge, improve -> yes/no
16. chain: pass, forward, help > yes/no
17. username: Is your username/name listed in subject line ->yes/no

18. Large text in e-mail 
yes, no (only yes, if html e-mail and size="+3" or size="5" or
higher. Visual inspection of e-mail will tell.)

19. Probability of being spam, according to ISU spam filter.  Look for
"Probability=x%" in the header of the email. And record the "x" or an
NA if the message doesn't have a probability. This variable will be
used to compare our classification results from our data. (Has a lot of missing values, because not everyone read email through the unversity mail system.)

20. Extended spam/mail category 
commercial->com,
lists->list, 
newsletter->news, 
ordinary->ord

21. Spam
yes, no
  1. Build a tree model to predict whether the email is spam or ham. Split your data into a 50% training and 50% test set. Build the tree on the training data, and predict the test data. (I have done some initial tuning to decide on the best inputs for minsplit=10 and cp=0.005. )

  1. (4pts) Compute the proportion of false positives (ham being predicted to be spam), and false negatives (spam predicted to be ham), in your training and test data. Which is the worse error here?
     tr_pred
       no yes
  no  716  15
  yes  15 340
     ts_pred
       no yes
  no  703  27
  yes  17 338

For the training data, the false positive rate is 15/731=0.021, and false negative rate is 0.042. On the test set, the false positive rate is 0.037 and false negative rate is 0.048. False positive is worse, because these are real emails being sent to the spam folder.

  1. (3pts) What combinations of variables tends to suggest the email is spam?

`(1) Email from Category com, (2) emails that are from other Category and have a sucker key word in the Subject, (3) emails that are from other Category, don’t have a sucker key word in the Subject, and have more than 5.5 digits in the sender’s name, (4) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, more than 9.5kb, (5) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 9.5kb, but arrives on a weekend. (6) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 9.5kb but bigger than 1.5kb, (7) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 1.5kb but sender has no name.